Project: ReCell Used phone prediction

By: Syeda Ambreen Karim Bokhari

Problem Statement

Background:

Buying and selling used smartphones used to be something that happened on a handful of online marketplace sites. But the used and refurbished phone market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models. Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone. There are plenty of other benefits associated with the used smartphone market.

Data Dictionary:

• brand_name: Name of manufacturing brand
• os: OS on which the phone runs
• screen_size: Size of the screen in cm
• 4g: Whether 4G is available or not
• 5g: Whether 5G is available or not
• main_camera_mp: Resolution of the rear camera in megapixels
• selfie_camera_mp: Resolution of the front camera in megapixels
• int_memory: Amount of internal memory (ROM) in GB
• ram: Amount of RAM in GB
• battery: Energy capacity of the phone battery in mAh
• weight: Weight of the phone in grams
• release_year: Year when the phone model was released
• days_used: Number of days the used/refurbished phone has been used
• new_price: Price of a new phone of the same model in euros
• used_price: Price of the used/refurbished phone in euros

Objective:

We need to create an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones.

We need to Explore the dataset and extract insights from the data to answer the following questions:

  1. Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone.
  2. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. 3.Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished smartphones. 4.Maximizing the longevity of mobile phones through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste.
  3. The impact of the COVID-19 outbreak may further boost the cheaper refurbished smartphone segment, as consumers cut back on discretionary spending and buy phones only for immediate needs.

Importing necessary libraries and data

Importing Data

Data Overview

Observations:
There are missing values in the following columns:

  1. main_camera_mp : 180
  2. selfie_camera_mp : 2
  3. int_memory : 10
  4. ram : 10
  5. battery : 6
  6. weight : 7

Exploratory Data Analysis

Data Preprocessing

Unique values in Categorical data

battery has a very high maximum value but there may be many rows that have high battery values, lets check lowest battery values

Sanity checks

After googling about cell phones i came to know that:
Smallest screen size in dumb phone was 1.7 inches, which is 4.32 cm
Smallest screen size in smart phone i found was 3 inch which is 7.62 cm
Biggest screen size which i found was 8.01-inch a fold phone which is 20.35 inches
The cell phone above 20.35 inches and below 1.7 inches these ranges may not have the correct screen_size given.

Exploratory Data Analysis (EDA)

Questions:

  1. What does the distribution of used phone prices look like?
  2. What percentage of the used phone market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?
  4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?
  6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the used phone price?

Univariate Analysis of Numeric Variables

Q1. What does the distribution of used phone prices look like?

Observations

Observations

Univariate Analysis of Categorical Variables

Mean used_price according to brand and operating system

Mean used_price according to brand

Count plots of brand_name, os, 4_g and 5_g

Q2. What percentage of the used phone market is dominated by Android devices?

4-g phones

5-g phones

Observations

Questions:

  1. What does the distribution of used phone prices look like?
  2. What percentage of the used phone market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?
  4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?
  6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the used phone price?

Missing Values Treatment

Bivariate & Multivariate Analysis of Categorical Variables

Q3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?

Q4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?

Q5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?

Q6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?

Q7.Which attributes are highly correlated with the used phone price?

Observations:

  1. new_price and used_price have high positive correlation; 0.93
  2. release_year and selfie_camera_mp shows moderately high positive correlation: 0.7
  3. weight and screen_size: 0.63, weight and battery: 0.7
  4. battery and screen size also have high correlation: 0.74
  5. days_used has high negative correlation with selfie_camera_mp: -0.56
  6. used_price has somewhat negative correlation with days_used: -0.47

Q8. Consumers cut back on discretionary spending and buy phones only for immediate needs, What percentage of phones comes under budget phone with used_price 500 or less.

Outlier Detection and Treatment

Let's treat outliers in the data by flooring and capping.

All of the points in each plot are drawn from the exact same distribution, so it's not fair to call any of the points outliers in the sense of there being bad data. But depending on the distribution in question, we may have almost all of the z-scores between -3 and 3 or instead there could be extremely large values.

Treating Categorical Variable: brand_name

Outlier Treatment on Numeric Variables: ram, screen_size and weight

1. ram

Most of ram values are 4 so IQR=0. i dont want to loose the variance. So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, 3 standard deviations from mean.

2. screen_size

screen_size has a lot of values that are flagged as suspicious by the boxplot, but in the histogram we can see that the distribution is skewed but these huge points aren't consistent with the overall distribution of the data. Nevertheless, having a heavy tail means we might want to consider statistics less sensitive to large values, so e.g. the median may be a better measure of central tendancy.

3. weight

weight has a lot of values that are flagged as suspicious by the boxplot, but in the histogram we can see that the distribution is skewed. Nevertheless, having a heavy tail means we might want to consider statistics less sensitive to large values, so e.g. the median may be a better measure of central tendancy.

Treating Outliers of the rest of the numeric variables

Standardizing continuous features

Standard scaler

Observations

Encoding rest of Categorical Variable

EDA

Building a Linear Regression model

Splitting target variable from Predictors

Normalised data of predicted variables

Standardizing numeric variables using standard scaler

Fitting Linear Regression model on train data set

Coefficients and intercept

Model performance evaluation

Let's check the performance of the model using different metrics.

We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$ .

We will define a function to calculate MAPE and adjusted $R^2$ .

The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0. We will create a function which will print out all the above metrics in one go.

Observations

The training 𝑅2 is 95.6%, indicating that the model explains 95.6% of the variation in the train data. So, the model is not underfitting.

MAE and RMSE on the train and test sets are comparable, which shows that the model is not overfitting.

MAE indicates that our current model is able to predict used phone prices within a mean error of ~10 Euros on the test data.

MAPE on the test set suggests we can predict within ~20.6% of the used phone prices.

Linear Regression using statsmodels

Let's build a linear regression model using statsmodels.

p-value > 0.05 shows they are not significant and can be removed.

Observations

Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

  1. No Multicollinearity

  2. Linearity of variables

  3. Independence of error terms

  4. Normality of error terms

  5. No Heteroscedasticity

TEST FOR MULTICOLLINEARITY

Detecting Multicollinearity with VIF

release_years have VIF slightly greater than 5.But it is a temporal variable This variable are correlated with each other.

1.Removing Multicollinearity

To remove multicollinearity

  1. Drop every column one by one that has a VIF score greater than 5.
  2. Look at the adjusted R-squared and RMSE of all these models.
  3. Drop the variable that makes the least change in adjusted R-squared.
  4. Check the VIF scores again.
  5. Continue till you get all VIF scores under 5.
  6. Let's define a function that will help us do this.

The above predictors have no multicollinearity and the assumption is satisfied.

Let's check the model performance.

Regression Evaluation Metrics

Regression Model will be evaluated using the following evaluation metrics:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$ $$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$ $$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Observations

The above predictors have no multicollinearity and the assumption is satisfied.

Let's check the model performance.

Now no feature has p-value greater than 0.05, so we'll consider the features in x_train4 as the final ones and olsmod2 as final model.

Observations

Other Assumptions of Linear regression

Now we'll check the rest of the assumptions on olsmod2.

  1. Linearity of variables & Independence of error terms

  2. Normality of error terms

  3. No Heteroscedasticity

1. Linearity of variables & Independence of error terms

2. TEST FOR NORMALITY

3. TEST FOR HOMOSCEDASTICITY

Fitting Linear Regression model on train data set

The model is able to explain 95.6% of the variation in the data, which is very good.

The train and test RMSE and MAE are and comparable. So, our model is not suffering from overfitting.

The MAPE on the test set suggests we can predict within 20.7% of the used phone prices.

Hence, we can conclude the model olsmod2 is good for prediction as well as inference purposes.

Let's compare the initial model created with sklearn and the final statsmodels model.

Final Model Summary

Observational summary:

  1. What does the distribution of used phone prices look like?
    Observations

    • It is the target variable.
    • The distribution of used_price is heavily skewed to the right.
    • The outliers to the right indicate that many cell phones, though used, have a very high Prices.
    • Mean is 109.9 much greater than median due to extreme values towards higher end.
    • There are a lot of outliers towards right, larger side.Majority of values 50% are between 45 and 126
  2. What percentage of the used phone market is dominated by Android devices?

    • 90.9% of used phone market is dominated by Android devices.
    • 1.9% Windows, 1.6% iOS and 5.7% other operating system phones.
      • os is ~91% dominated by Android in used phone market. Other operating systems have very small percentage.
      • 4-g technology is present in 66.1% as compared to 33.9% phones with out 4-g.
      • 5-g technology is present only in 4.3% as compared to 95.7% phones with out 5-g.5-g phones could be newer phones
  3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?
    • Brands vary in ram in range 1-6GB or ~4gb ram on average
  4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?
    • On average phones with higher battery also have higher weight.
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?
    • Screen size on average are between 6"and 8".
    • There are also some very big screen size present, like blackberry 11"+ and Microsoft 10"+, which may be some error in the data.
  6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?
    • Specifications of selfie_camera_mp > 8mp
    • max is 32mp
    • mean is 18.7mp
    • 50% of these phones have under 16mp
  7. Which attributes are highly correlated with the used phone price?

    1. new_price and used_price have high positive correlation; 0.93
    2. used_price has somewhat negative correlation with days_used: -0.47
    • Other significant correlations:
      1. release_year and selfie_camera_mp shows moderately high positive correlation: 0.7
      2. weight and screen_size: 0.63, weight and battery: 0.7
      3. battery and screen size also have high correlation: 0.74
      4. days_used has high negative correlation with selfie_camera_mp: -0.56
  8. Consumers cut back on discretionary spending and buy phones only for immediate needs, What percentage of phones comes under budget phone with used_price 500 or less.
    • Majority:92.1% (3289 out of 3571) phones come under the low budget phone: (have price 250 or less)
    • Majority:6.0% (215 out of 3571) phones come under the medium budget phone: (have price 250 or less)
    • Minority:only 1.9% (67 out of 3571) phones come under the high bidget phone: (have price above 500)

Conclusion:

The final predictor variables for the model are:

All these variables have probability p<0.05 rejecting the null hypothesis that they are insignificant.

Actionable Insights and Recommendations

-